R package will be? modeling toolThis is a website of the project “How successful your next R package will be? A prediction model using R packages features” by:
The website is a part of the final assignment in 140.712.01 Advanced Data Science II class of 2018/19 at JHSPH.
Two minute screencast with narration showing a demo of the project can be accessed by clicking the image below (equivalently, click here).
The project’s GitHub repository can be found here: https://martakarass.github.io/712-final_project/
In particular, the project’s GitHub repository contains:
final RMarkdown file: 2018-12-09-project-summary.Rmd (link)
corresponding compiled HTML file: 2018-12-09-project-summary.html (link)
How many times you have found yourself spending long, long hours wrapping up an R package, polishing and pushing it to CRAN, and realize after that almost nobody downloads it and uses it?
Hence, you may keep asking yourself:
Here we come, with the How successful your next R package will be? modeling tool that analyzes your package prototype based on its:
and predict a number of downloads it will generate over the time!
Develop a predictive model that takes as an input package’s features and predicts a number of downloads it will generate over (a) 3 months, (b) 1 year.
Identify what features of an R package derived from package’s features such as: title and description text, metadata, code files content, attached data content, vignettes content etc. are associated with a high number of downloads.
Data sources we used include:
R packages’ description sites (from CRAN) information,R packages’ archive’s files (from CRAN) to access information about the package from its 1st release version,Exemplary data source: R package CRAN site
Examples of words (word’s cores) in features generated from package’s title (LEFT HAND SIDE) and package’s description (RIGHT HAND SIDE):
Examples of words (word’s cores) in features generated from package’s meta data, code files, attached data content, vignettes:
We have trained and tuned three different types of predictive model:
The prediction results of the outcome logarithm of number of downloads over 1 year time on the test set are shown below.
The top variables in terms of:
from modeling the outcome logarithm of number of downloads over 1 year time are shown below.
Random Forest and Support Vector Machine minimize the most MSE in our test dataset.
The top 5 variables (variable importance rank) capture features related to packages: title (“ggplot”), authors (number), unit testing (files), description (“interface”), data files.
R package developers should include the top features highlighted in our results to increase the ‘success’ of their products.